Finding Sequence Clusters: A Shared Near Neighbors Approach
نویسندگان
چکیده
Sequence clustering is one of most fundamental topics which can be applied in various research field. Most of previous work on sequence clustering is dedicated to the single-label clustering in which the whole similarity of equal-length sequence is considered and measured by Euclidean distance function. However, intrinsic properties behind sequence demand the multi-label clustering. In addition, the Euclidean distance in the high dimensional space introduce the problem of dimensionality curse. Therefore, in this paper, we employ the concept of shared near neighbors (SNN), for sequence similarity, which will be integrated in the multi-label clustering process. Given a set of sequences, in our approach, we first apply the sliding window technique and the DCT mapping on sequences to obtain feature vectors. Those feature vectors, associated with the SNN similarity, are further grouped by applying the graph-based clustering and the hierarchical clustering, respectively. We also design a validity measure and perform experiments to show the efficiency and effectiveness of our approach. Meanwhile, those feature vectors are also approximated by the minimum bounding rectangles (MBR). Due to the less amount of MBRs, compared to all feature vectors, the computational complexity can be reduced accordingly without compromising clustering validity.
منابع مشابه
A New Shared Nearest Neighbor Clustering Algorithm and its Applications
Clustering depends critically on density and distance (similarity), but these concepts become increasingly more difficult to define as dimensionality increases. In this paper we offer definitions of density and similarity that work well for high dimensional data (actually, for data of any dimensionality). In particular, we use a similarity measure that is based on the number of neighbors that t...
متن کاملGenetic Structure of SSR1 & SSR2 loci from Iranian Mycobacterium Avium Subspecies Paratuberculosis Isolates by a Short Sequence Repeat Analysis Approach
Abstract Background and Objective: Paratuberculosis has been repeatedly reported from Iranian ruminant herds. The extrem fastidious nature of Mycobacterium avium subspecies paratuberculsos hinders genomic diversity studies of the pathogen. Short Sequence Repeat analysis is one of the genome-based approches recently developed to overcome this d...
متن کاملNovel Application of Near-infrared Spectroscopy and Chemometrics Approach for Detection of Lime Juice Adulteration
The aim of this study is to investigate the novel application of a handheld near infra-red spectrophotometer coupled with classification methodologies as a screening approach in detection of adulterated lime juices. For this purpose, a miniaturized near infra-red spectrophotometer (Tellspec®) in the spectral range of 900–1700 nm was used. Three diffuse reflectance spectra of 31 pure...
متن کاملNovel Application of Near-infrared Spectroscopy and Chemometrics Approach for Detection of Lime Juice Adulteration
The aim of this study is to investigate the novel application of a handheld near infra-red spectrophotometer coupled with classification methodologies as a screening approach in detection of adulterated lime juices. For this purpose, a miniaturized near infra-red spectrophotometer (Tellspec®) in the spectral range of 900–1700 nm was used. Three diffuse reflectance spectra of 31 pure...
متن کاملClustering Using a Similarity Measure Based on Shared Near Neighbors
A nonparametric clustering technique incorporating the concept of similarity based on the sharing of near neighbors is presented. In addition to being an essentially paraliel approach, the computational elegance of the method is such that the scheme is applicable to a wide class of practical problems involving large sample size and high dimensionality. No attempt is made to show how a priori pr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Inf. Sci. Eng.
دوره 31 شماره
صفحات -
تاریخ انتشار 2015